Virtual Environments with Pytorch ================================= Virtual environments are essential for isolating project dependencies and ensuring compatibility across different projects. This guide explains how to create a virtual environment using Python's built-in `venv` module. Prerequisites ------------- - Python 3.3 or newer - A terminal or command prompt to execute commands. .. note:: Some parts of this tutorial require the use of Git, which is pre-installed onto TACC systems. Steps to Create and Activate a Virtual Environment -------------------------------------------------- **Step 1. Connect to our systems.** If you are unfamiliar with how to do this, please review the section called **Connecting to TACC**. Once we have connected to TACC systems, we can now create the virtual environment. .. note:: It is best practice to use the $WORK directory to host our environment, since the $SCRATCH directory is regularly purged, and $HOME does not have the storage space for ML tasks. **Step 3. Create the Virtual Environment** Run this command to create a virtual environment. You can replace 'myenv' with whatever you want to name your virtual environment. :: python3 -m venv myenv **Step 4. Verify the Creation** After running the command, a new directory (e.g., `myenv`) will be created in your current location. This directory contains the files needed for the virtual environment. :: (base) UserName@System myenv % ls bin include lib pyvenv.cfg Understanding the Structure ^^^^^^^^^^^^^^^^^^^^^^^^^^^ The virtual environment directory contains: - **`bin` or `Scripts`**: Contains the executables, including the Python interpreter. - **`lib`**: Includes the standard library and site packages for your virtual environment. - **`pyvenv.cfg`**: Configuration file for the virtual environment. **Step 5: Activate the environment** :: source myenv/bin/activate Upon activation, you should see parentheses around the name of your environment appear in front of your working directory: :: (myenv) login3.frontera(470)$ In the next section, we will test this virtual environment by installing pytorch into it and then running an example script. Testing our Virtual Environment with multigpu_torchrun.py --------------------------------------------------------- To demonstrate how to use our virtual environment, we will download the multigpu_torchrun.py script from a github repository, install pytorch, and then run an example benchmarking function from the script, all within our virtual environment. **multigpu_torchrun.py** is a script from the official pytorch repository that leverages distributed data parallel (DDP) to split ML training tasks across GPUs, allowing for a more efficient runtime. The multigpu_torchrun.py script can be found in the github repository below: `https://github.com/pytorch/examples `_ **Step 6. Download the repository containing code to run** You can download a github repository through the command line with the command **git clone**. :: git clone https://github.com/pytorch/examples.git **Step 7. Request a Node through idev** To run our example script, we'll need to allocate a single node for the purposes of our task. One node on Frontera has 4 GPUs, which is adequate to run multigpu_torchrun.py's benchmarking function. Begin your `idev `_ session by running the following in your virtual environment: :: idev -N 1 -n 1 -p rtx-dev -t 02:00:00 This will request a **single compute node (-N 1 -n 1)** in the **rtx-dev** partition/queue **(-p)** for a time length of **two hours (-t 02:00:00).** The rtx-dev queue is specifically for the NVIDIA RTX-5000 GPU compute nodes on Frontera systems, which are compatible with CUDA and pytorch by extension. To determine the queues and hardware specifications of TACC's HPC systems, see our `website `_ for more information. When you request a node through idev, you will be taken to a loading screen. After your idev session starts, your current working directory will look like: :: (myenv) c196-011[rtx](452)$ This is how you will know your idev session has begun. **Ensure you see the (myenv) tag before your working directory. If you do not, activate your virtual environment again.** **Step 8. Download Pytorch into our Virtual Environment** To run multigpu_torchrun, we will need to install pytorch. Run the following pip command inside of your virtual environment: :: pip3 install torch torchvision torchaudio **Step 9. CD into the ddp tutorial series folder** We should now see a new directory called **examples** present in our virtual environment. **cd** into the following directory: :: cd examples/distributed/ddp-tutorial-series *This will be a hidden directory.* **Step 10. Run multigpu_torchrun.py** And within our virtual environment, we will use the **torchrun** command to launch the training script across all of the available nodes (1). :: torchrun --standalone --nproc_per_node=gpu multigpu_torchrun.py 5 10 This will distribute the training workload across all GPUs on your machine using `torch.distributed` and `DistributedDataParallel` (DDP), and train the model for 5 epochs and run checkpoints every 10 seconds. When run successfully, you should get a result like this: .. image:: images/multigpu_result.png :alt: multigpu_result .. note:: The task may take a few minutes to run. Congratulations! You have now run a successful multi-GPU training task in a virtual python environment. Deactivating a Virtual Environment ---------------------------------- When you’re done working in your virtual environment, you can deactivate it to return to the global Python environment: 1. Simply run the following command in your terminal (works on all operating systems): :: deactivate 2. You’ll notice the environment name disappears from your command line, confirming the environment has been deactivated. Troubleshooting --------------- - If the `activate` command is not recognized, ensure you’re in the correct directory where the virtual environment was created. Congratulations! You now know how to activate, deactivate, and run code in a virtual environment to keep your Python projects organized and conflict-free.